We explore the performance of text-davinci-003 and all Llama-2 models (both base and chat) on experiments 1-3 from Degano et al 2024, experiments 1-2 from Marty et al 2023 and experiments 4-6 from Marty et al 2022. All practice trials are used as a few-shot prompt (i.e. correct solutions were presented).
The results were retrieved by retrieving the log probability of the labels “good” / “bad” following the prompt. The selected answer is identified by selecting the max probability. For simplicity, the prompts are linked here:
Below we analyse the selected response. We apply two analyses:
In our other studies, thw WTA approach has shown better fit to human data. However, based on plots below, the probability based analysis might be better for these datasets.
process_response <- function(d) {
d <- d %>%
rowwise() %>%
mutate(
chosen_response_llh = max(Mean_logprob_answer_good, Mean_logprob_answer_bad),
chosen_response = ifelse(chosen_response_llh == Mean_logprob_answer_good, "Mean_logprob_answer_good", "Mean_logprob_answer_bad"),
chosen_response = str_split(chosen_response, "_", simplify=T)[,4],
norm_factor = sum(exp(Mean_logprob_answer_good), exp(Mean_logprob_answer_bad)),
prob_good = exp(Mean_logprob_answer_good) / norm_factor,
prob_bad = exp(Mean_logprob_answer_bad) / norm_factor
)
return(d)
}
We apply these analyses by-study and then group the respective stats by conditions of each respective study.
degano2024_processed <- process_response(degano2024)
marty2023_processed <- process_response(marty2023)
marty2022_processed <- process_response(marty2022)
degano2024_acc_rate <- degano2024_processed %>%
mutate(
is_good = as.numeric(chosen_response == "good")
) %>%
group_by(model, Condition, Experiment) %>%
tidyboot_mean(column = is_good)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## ℹ The deprecated feature was likely used in the purrr package.
## Please report the issue at <]8;;https://github.com/tidyverse/purrr/issueshttps://github.com/tidyverse/purrr/issues]8;;>.
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
degano2024_acc_rate
## # A tibble: 105 × 8
## # Groups: model, Condition [35]
## model Condition Experiment n empiri…¹ ci_lo…² mean ci_up…³
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Llama-2-13b-chat-hf Bad Exp_1 18 0 0 0 0
## 2 Llama-2-13b-chat-hf Bad Exp_2 18 0 0 0 0
## 3 Llama-2-13b-chat-hf Bad Exp_3 18 0 0 0 0
## 4 Llama-2-13b-chat-hf Good Exp_1 15 0 0 0 0
## 5 Llama-2-13b-chat-hf Good Exp_2 15 0 0 0 0
## 6 Llama-2-13b-chat-hf Good Exp_3 15 0 0 0 0
## 7 Llama-2-13b-chat-hf Good-Excl Exp_1 3 0 0 0 0
## 8 Llama-2-13b-chat-hf Good-Excl Exp_2 3 0 0 0 0
## 9 Llama-2-13b-chat-hf Good-Excl Exp_3 3 0 0 0 0
## 10 Llama-2-13b-chat-hf Target_1 Exp_1 6 0 0 0 0
## # … with 95 more rows, and abbreviated variable names ¹empirical_stat,
## # ²ci_lower, ³ci_upper
marty2023_acc_rate <- marty2023_processed %>%
mutate(
is_good = as.numeric(chosen_response == "good")
) %>%
group_by(model, Quantifier_type, Condition, Experiment) %>%
tidyboot_mean(column = is_good)
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
marty2023_acc_rate
## # A tibble: 112 × 9
## # Groups: model, Quantifier_type, Condition [56]
## model Quant…¹ Condi…² Exper…³ n empir…⁴ ci_lo…⁵ mean ci_up…⁶
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Llama-2-13b-chat… Modal Bad Exp_1 18 0 0 0 0
## 2 Llama-2-13b-chat… Modal Bad Exp_2 18 0 0 0 0
## 3 Llama-2-13b-chat… Modal Good Exp_1 18 0 0 0 0
## 4 Llama-2-13b-chat… Modal Good Exp_2 18 0 0 0 0
## 5 Llama-2-13b-chat… Modal Target… Exp_1 6 0 0 0 0
## 6 Llama-2-13b-chat… Modal Target… Exp_2 6 0 0 0 0
## 7 Llama-2-13b-chat… Modal Target… Exp_1 6 0 0 0 0
## 8 Llama-2-13b-chat… Modal Target… Exp_2 6 0 0 0 0
## 9 Llama-2-13b-chat… Nominal Bad Exp_1 18 0 0 0 0
## 10 Llama-2-13b-chat… Nominal Bad Exp_2 18 0 0 0 0
## # … with 102 more rows, and abbreviated variable names ¹Quantifier_type,
## # ²Condition, ³Experiment, ⁴empirical_stat, ⁵ci_lower, ⁶ci_upper
marty2022_acc_rate <- marty2022_processed %>%
mutate(
is_good = as.numeric(chosen_response == "good")
) %>%
group_by(model, Negation, Polarity, Inference_type, Condition, Experiment) %>%
tidyboot_mean(column = is_good)
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
marty2022_acc_rate
## # A tibble: 504 × 11
## # Groups: model, Negation, Polarity, Inference_type, Condition [504]
## model Negat…¹ Polar…² Infer…³ Condi…⁴ Exper…⁵ n empir…⁶ ci_lo…⁷ mean
## <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Llama-2-… High Negati… DIST Bad Exp_4 3 1 1 1
## 2 Llama-2-… High Negati… DIST Good Exp_4 3 1 1 1
## 3 Llama-2-… High Negati… DIST Target Exp_4 3 0.667 0 0.675
## 4 Llama-2-… High Negati… FC Bad Exp_4 3 1 1 1
## 5 Llama-2-… High Negati… FC Good Exp_4 3 0.667 0 0.658
## 6 Llama-2-… High Negati… FC Target Exp_4 3 0.667 0 0.686
## 7 Llama-2-… High Negati… II Bad Exp_4 3 1 1 1
## 8 Llama-2-… High Negati… II Good Exp_4 3 1 1 1
## 9 Llama-2-… High Negati… II Target Exp_4 3 0.667 0 0.667
## 10 Llama-2-… High Negati… SI Bad Exp_4 3 0.667 0 0.670
## # … with 494 more rows, 1 more variable: ci_upper <dbl>, and abbreviated
## # variable names ¹Negation, ²Polarity, ³Inference_type, ⁴Condition,
## # ⁵Experiment, ⁶empirical_stat, ⁷ci_lower
The processed data is saved such that the probabilities of each response option as well as the selected response are included on top of the raw materials’ csvs (columns prob_good, prob_bad, chosen_response are the new columns containing the model results), along with the information which model was used to produce the results. The columns “prob_good”, “prob_bad” contain trial-level probability of each option, while “chosen_response” contains the chosen response based on the argmax over prob_good and prob_bad (i.e., WTA strategy).
Below, we plot the mean acceptance rate (i.e., the mean proportion of judgments that a trigger sentence is good) by-model, by experiment and by-condition.
Plot for Degano et al 2024:
degano2024_acc_rate %>%
ggplot(., aes(x = Condition, y = mean, ymin = ci_lower, ymax = ci_upper)) +
geom_col() +
geom_errorbar(width = 0.1) +
facet_wrap(model~Experiment, ncol = 3) +
ylab("Acceptance rate") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
Marty et al 2023:
marty2023_acc_rate %>%
ggplot(., aes(x = Condition, y = mean, fill = Quantifier_type, ymin = ci_lower, ymax = ci_upper)) +
geom_col(position = position_dodge()) +
geom_errorbar(position = position_dodge(0.95), width = 0.1) +
facet_wrap(model~Experiment) +
ylab("Acceptance rate") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
Marty et al 2022:
# by experiment
e4 <- marty2022_acc_rate %>%
filter(Experiment == "Exp_4") %>%
ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
geom_col(position = position_dodge()) +
geom_errorbar(position = position_dodge(0.95), width = 0.1) +
facet_wrap(Inference_type~model, nrow = 4) +
ylab("Acceptance rate") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
e5 <- marty2022_acc_rate %>%
filter(Experiment == "Exp_5") %>%
ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
geom_col(position = position_dodge()) +
geom_errorbar(position = position_dodge(0.95), width = 0.1) +
facet_wrap(Inference_type~model, nrow = 4) +
ylab("Acceptance rate") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
e6 <- marty2022_acc_rate %>%
filter(Experiment == "Exp_6") %>%
ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
geom_col(position = position_dodge()) +
geom_errorbar(position = position_dodge(0.95), width = 0.1) +
facet_wrap(Inference_type~model, nrow = 4) +
ylab("Acceptance rate") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
# bind the plots together
gridExtra::grid.arrange(e4, e5, e6)
Now the same plots are replicated with the average probability assigned to the “acceptance” (i.e., “good”) response.
## # A tibble: 105 × 8
## # Groups: model, Experiment [21]
## model Experiment Condition n empiri…¹ ci_lo…² mean ci_up…³
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Llama-2-13b-chat-hf Exp_1 Bad 18 0.235 0.212 0.235 0.257
## 2 Llama-2-13b-chat-hf Exp_1 Good 15 0.219 0.201 0.219 0.237
## 3 Llama-2-13b-chat-hf Exp_1 Good-Excl 3 0.213 0.198 0.213 0.222
## 4 Llama-2-13b-chat-hf Exp_1 Target_1 6 0.250 0.221 0.251 0.287
## 5 Llama-2-13b-chat-hf Exp_1 Target_2 6 0.209 0.174 0.209 0.252
## 6 Llama-2-13b-chat-hf Exp_2 Bad 18 0.239 0.221 0.239 0.257
## 7 Llama-2-13b-chat-hf Exp_2 Good 15 0.230 0.213 0.230 0.246
## 8 Llama-2-13b-chat-hf Exp_2 Good-Excl 3 0.215 0.179 0.215 0.257
## 9 Llama-2-13b-chat-hf Exp_2 Target_1 6 0.230 0.199 0.229 0.259
## 10 Llama-2-13b-chat-hf Exp_2 Target_2 6 0.222 0.185 0.222 0.258
## # … with 95 more rows, and abbreviated variable names ¹empirical_stat,
## # ²ci_lower, ³ci_upper
## # A tibble: 112 × 9
## # Groups: model, Quantifier_type, Condition [56]
## model Quant…¹ Condi…² Exper…³ n empir…⁴ ci_lo…⁵ mean ci_up…⁶
## <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Llama-2-13b-chat… Modal Bad Exp_1 18 0.226 0.199 0.226 0.252
## 2 Llama-2-13b-chat… Modal Bad Exp_2 18 0.284 0.259 0.285 0.312
## 3 Llama-2-13b-chat… Modal Good Exp_1 18 0.224 0.197 0.225 0.253
## 4 Llama-2-13b-chat… Modal Good Exp_2 18 0.251 0.223 0.251 0.279
## 5 Llama-2-13b-chat… Modal Target… Exp_1 6 0.248 0.193 0.246 0.296
## 6 Llama-2-13b-chat… Modal Target… Exp_2 6 0.287 0.222 0.287 0.348
## 7 Llama-2-13b-chat… Modal Target… Exp_1 6 0.222 0.186 0.223 0.268
## 8 Llama-2-13b-chat… Modal Target… Exp_2 6 0.256 0.228 0.255 0.278
## 9 Llama-2-13b-chat… Nominal Bad Exp_1 18 0.227 0.197 0.227 0.259
## 10 Llama-2-13b-chat… Nominal Bad Exp_2 18 0.266 0.235 0.266 0.301
## # … with 102 more rows, and abbreviated variable names ¹Quantifier_type,
## # ²Condition, ³Experiment, ⁴empirical_stat, ⁵ci_lower, ⁶ci_upper
## # A tibble: 504 × 11
## # Groups: model, Negation, Polarity, Inference_type, Condition [504]
## model Negat…¹ Polar…² Infer…³ Condi…⁴ Exper…⁵ n empir…⁶ ci_lo…⁷ mean
## <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Llama-2-… High Negati… DIST Bad Exp_4 3 0.594 0.526 0.594
## 2 Llama-2-… High Negati… DIST Good Exp_4 3 0.580 0.537 0.582
## 3 Llama-2-… High Negati… DIST Target Exp_4 3 0.568 0.438 0.565
## 4 Llama-2-… High Negati… FC Bad Exp_4 3 0.575 0.551 0.576
## 5 Llama-2-… High Negati… FC Good Exp_4 3 0.567 0.436 0.571
## 6 Llama-2-… High Negati… FC Target Exp_4 3 0.568 0.484 0.566
## 7 Llama-2-… High Negati… II Bad Exp_4 3 0.530 0.5 0.531
## 8 Llama-2-… High Negati… II Good Exp_4 3 0.634 0.613 0.635
## 9 Llama-2-… High Negati… II Target Exp_4 3 0.539 0.481 0.538
## 10 Llama-2-… High Negati… SI Bad Exp_4 3 0.528 0.423 0.528
## # … with 494 more rows, 1 more variable: ci_upper <dbl>, and abbreviated
## # variable names ¹Negation, ²Polarity, ³Inference_type, ⁴Condition,
## # ⁵Experiment, ⁶empirical_stat, ⁷ci_lower
degano2024_prob %>%
ggplot(., aes(x = Condition, y = mean, ymin = ci_lower, ymax = ci_upper)) +
geom_col() +
geom_errorbar(width = 0.1) +
facet_wrap(Experiment~model) +
ylab("Probability of acceptance") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
marty2023_prob %>%
ggplot(., aes(x = Condition, y = mean, fill = Quantifier_type, ymin = ci_lower, ymax = ci_upper)) +
geom_col(position = position_dodge()) +
geom_errorbar(position = position_dodge(0.95), width = 0.1) +
facet_wrap(Experiment~model) +
ylab("Probability of acceptance") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
e4_prob <- marty2022_prob %>%
filter(Experiment == "Exp_4") %>%
ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
geom_col(position = position_dodge()) +
geom_errorbar(position = position_dodge(0.95), width = 0.1) +
facet_wrap(Inference_type~model, nrow = 4) +
ylab("Probability of acceptance") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
e5_prob <- marty2022_prob %>%
filter(Experiment == "Exp_5") %>%
ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
geom_col(position = position_dodge()) +
geom_errorbar(position = position_dodge(0.95), width = 0.1) +
facet_wrap(Inference_type~model, nrow = 4) +
ylab("Probability of acceptance") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
e6_prob <- marty2022_prob %>%
filter(Experiment == "Exp_6") %>%
ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
geom_col(position = position_dodge()) +
geom_errorbar(position = position_dodge(0.95), width = 0.1) +
facet_wrap(Inference_type~model, nrow = 4) +
ylab("Probability of acceptance") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
# bind the plots together
gridExtra::grid.arrange(e4_prob, e5_prob, e6_prob)
An exploration of zero-shot results was also conducted (with Llama 2 7b base, 7b chat, 13b base and 13b chat). The WTA results and trial-level probability results are shown below.
For Degano et al 2024, as an example for direct comparison, put together zero-shot with few-shot results:
zero_shot_acc_rate_prompts <- zero_shot_degano_acc_rate %>%
mutate(prompting = "zero-shot") %>%
rbind(.,
degano2024_acc_rate %>%
#filter(model == "Llama-2-13b-chat-hf") %>%
mutate(prompting = "few-shot"))
zero_shot_prob_prompts <- zero_shot_degano_prob %>%
mutate(prompting = "zero-shot") %>%
rbind(.,
degano2024_prob %>%
#filter(model == "Llama-2-13b-chat-hf") %>%
mutate(prompting = "few-shot"))
Average probability:
zero_shot_prob_prompts %>%
ggplot(., aes(x = Condition, y = mean, ymin = ci_lower, ymax = ci_upper, fill = prompting)) +
geom_col(position=position_dodge()) +
geom_errorbar(width = 0.1, position = position_dodge(0.95)) +
facet_wrap(Experiment~model) +
ylab("Probability of acceptance") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
WTA:
zero_shot_acc_rate_prompts %>%
ggplot(., aes(x = Condition, y = mean, ymin = ci_lower, ymax = ci_upper, fill = prompting)) +
geom_col(position=position_dodge()) +
geom_errorbar(width = 0.1, position = position_dodge(0.95)) +
facet_wrap(Experiment~model) +
ylab("Acceptance rate") +
theme_csp() +
theme(axis.text.x = element_text(angle=30))
Now we replicate the overview plots from above on the zero-shot data.
WTA:
Item-level probability (wide-scope):